Data Wrangling

In this section, I will assess the quality and tidiness of the data. I will also clean the data.

Assess Data

There are no missing values in the data.

Assessing

  1. There are no missing values.
  2. There are 534 duplicate records in the dataset.
  3. The Class and Category variables are of the object data type.
  4. Some values in the Category variable have strange characters like -00a2c6bab1e53f679cdd4fdc772cd291928c109b9b747652639a1700d844f719-1.raw.
  5. Structural issue in the Category variable. Both category of attack and type included in the same column. For example Spyware-Gator and Ransomware-Ako
  6. All the values in the pslist.nprocs64bit, svcscan.interactive_process_services, handles.nport, svcscan.interactive_process_services variables are zeros.
  7. Possible outliers in the pslist.nppid, pslist.avg_handlers, handles.avg_handles_per_proc, handles.nfile, handles.ndesktop, handles.nkey, handles.nthread, handles.nsemaphore, handles.nsection, malfind.ninjections, malfind.commitCharge, malfind.protection, malfind.uniqueInjections, psxview.not_in_pslist, psxview.not_in_ethread_pool, psxview.not_in_pspcid_list, psxview.not_in_csrss_handles, psxview.not_in_session, psxview.not_in_deskthrd

Cleaning

  1. Remove duplicate records.
  2. Remove strange characters in the Category variable.
  3. Fix structural issue in the Categorical column. Create a new column called type.
  4. Convert the Class and Category to categorical data type.
  5. Remove the pslist.nprocs64bit, svcscan.interactive_process_services, handles.nport, svcscan.interactive_process_services variables.

Data Cleaning

Remove duplicate records

Remove strange characters from the Category column

There are 28346 unique values in the Category column. Let's fix this.

Create a new column for type of attack

Convert the Class and Category to categorical data type.

Remove the pslist.nprocs64bit, svcscan.interactive_process_services, handles.nport, svcscan.interactive_process_services variables.

Descriptive Analysis

Summary Statistics

What is the Proportion of the Class Variable?

49.7% of the data in the dataset are records of malware attacks. While 50.3% of the data are benign connections. The classes in the dataset are balanced.

What is the proportion of the category_new variable?

There are 29,231 records of benign connections, 9,815 records of spyware attacks, 9529 records of ransomware attacks and 9487 records of trojan attacks.

What are the different types of ransomware attacks in the dataset?

There are five types of ransomware attack in the dataset: shade, ako, conti, maze, pysa.

There are 2128 shade ransomware attacks, 2000 ako ransomware attacks, 1988 conti ransomware attacks, 1754 maze ransomware attacks and 1659 pysa ransomware attacks.

What are the different types of spyware attacks in the dataset?

There are five types of spyware attack in the dataset: Transponder, 180solutions, CWS, Gator, TIBS.

There are 2410 Transponder spyware attacks, 2000 180solutions spyware attacks, 2000 CWS spyware attacks, 1995 Gator spyware attacks and 1410 TIBS spyware attacks.

What are the different types of trojan attacks in the dataset?

There are five types of trojan attack in the dataset: Refroso, Scar, Emotet, Zeus, Reconyc.

There are 2000 Refroso trojan attacks, 2000 Scar trojan attacks, 1967 Emotet trojan attacks, 1950 Zeus trojan attacks and 1570 Reconyc trojan attacks.

Histograms: Check Data Distribution of some Variables

Distribution of the 'pslist.nproc' variable

Distribution of the 'svcscan.shared_process_services' variable

The svcscan.shared_process_services is left skewed.

Distribution of the 'svcscan.kernel_drivers' variable

The svcscan.kernel_drivers is left skewed.

Distribution of the 'handles.nmutant' variable

Most of the values fall within 250-300

Distribution of the 'handles.nevent' variable

Distribution of the 'malfind.ninjections' variable

The malfind.ninjections variable is right-skewed.

Diagnostic Analysis

In this section, I will take a deep dive into the data to extract some insights from it.

Why are some memory records malware attacks?

Analysis of the svcscan.kernel_drivers variable

The plot above on the left is a box plot, while the plot on the right is a violin plot. The svcscan.kernel_drivers variable of malware attacks have values less than 200. If the value of the svcscan.kernel_drivers variable is below 200, there is a high chance that it is a malware attack.

Analysis of the svcscan.nservices variable

The svcscan.nservices variable of malware attacks have values less than 350. If the value of the svcscan.nservices variable is below 350, there is a high chance that it is a malware attack.

Analysis of the svcscan.shared_process_services variable

The svcscan.shared_process_services variable of malware attacks have values less than 100. If the value of the svcscan.shared_process_services variable is below 100, there is a high chance that it is a malware attack.

Analysis of the handles.nevent variable

For the variable handles.nevent, most of the malware attacks have a value of 3000 as indicated in the violin plot above. Most benign records are within the range of 3500-5000. Values below 3000 could be malicious attacks.

Analysis of the 'dlllist.avg_dlls_per_proc variable

For the variable dlllist.avg_dlls_per_proc, most of the malware attacks have values ranging from 35 - 40 as indicated in the violin plot above. Most benign records are above the value 40. If the value of dlllist.avg_dlls_per_proc is below 40, there is a high chance that it is a malicious attacks.

Analysis of the 'dlllist.ndlls' variable

For the variable dlllist.ndlls, most of the malware attacks have values ranging from 1400 - 1700 as indicated in the violin plot above. Most benign records are within the range of 2000-2300. If the value of dlllist.ndlls is below 1500, there is a high chance that it is a malicious attacks.

Analysis of the ''handles.nkey'' variable

Analysis of the pslist.avg_handlers variable

Analysis of the handles.avg_handles_per_proc variable

Most benign values of the handles.avg_handles_per_proc variable are within the range of 200-350. If the value is below 200 or above 350, it could be a malware attack.

Correlation Analysis

There is a high positive correlation between svcscan.nservices and svcscan.kernel_drivers. The only points showing are the malware.

Plot Matrices

There is a high correlation between mlfind.ninjections and malfind.protection.

Relationship between svcscan.process_services and svcscan.nactive variables

Most of the benign records have values greater than 22.5 of the svcscan.process_services and 110 of the svcscan.nactive variable. Most of the ransomware attacks are within 17.5 - 25 value of the svcscan.process_services variable and 90 - 120 of the svcscan.nactive variable. From the scatter plot, it is easy to identify the clusters.

Relationship between svcscan.shared_process_services and svcscan.process_services

From the scatter plot between the svcscan.shared_process_services and svcscan.process_services variables, we see that values below 24 (of the svcscan.process_services variable) are malware attacks. While values above 24 are benign. Also, the benign records are also above 110 (of svcscan.shared_process_services variable).

Relationship between handles.nkey and svcscan.process_services

Most of the benign records are above 24 (of the svcscan.process_services variable) but below 1500 of the handles.nkey.

Relationship between handles.nkey and svcscan.shared_process_services

The values above 118 of the svcscan.shared_process_services variable but below 700 of the handles.nkey variable are benign. Also, the values above 118 of the svcscan.shared_process_services variable but above 1200 of the handles.nkey variable are benign. Most of the values below 118 of the svcscan.shared_process_services variable are malware attacks.

Relationship between dlllist.avg_dlls_per_proc and svcscan.kernel_drivers

Most of the benign values are above 45 (of the dlllist.avg_dlls_per_proc) and above 200 of the svcscan.kernel_drivers variable.

Relationship between handles.nevent and handles.nkey

The benign values are above 3500 of the handles.nevent variable but between the range of 600 - 1600 of handles.nkey variable. Malware attacks are below 4000 of the handles.nevent variable.

Most of the benign class are below 200 of the handles.ndirectory variable, but above 4000 of the handles.nevent variable. All the values above 200 of the handles.ndirectory variable belong to the malware class.

Relationship between ldrmodules.not_in_load_avg and dlllist.avg_dlls_per_proc variables

Most of the records in the benign class have values above 40 for the dlllist.avg_dlls_per_proc but below 0.1 (ldrmodules.not_in_load_avg). All values above 0.1 of the ldrmodules.not_in_load_avg variable are malware attacks. Also, values below 30 of the dlllist.avg_dlls_per_proc variable are malware attacks.

Relationship between handles.nevent and handles.nmutant

The benign class have values greater than 3000 (of the handles.nevent) and greater than 300 of the handles.nmutant variable. All the records that have values greater than 4500 (of the handles.nevent) are benign. All values that are less than 3000 of the (of the handles.nevent) are malware. There is a small cluster of malware class between 500 - 600 of the handles.nmutant variable and 2500 - 3500 of the handles.nevent variable.

Correlation Heatmap

There is a high positive correlation between:

Predictive Analysis

In this section, I will train a linear classifier and a non-linear classifier to detect malicious and benign connections.

Experiment 1: Train Models with all the features

Label Encoding

Convert the 'Benign' class to 0 and the 'Malware' class to 1

Separate the features and target variable

The target variable is the Class variable.

Split Data into the training set and test set

70% of the data will be used for training the models. While 30% will be used to evaluate the models.

Feature Scaling

Apply normalization to the features to keep them on the same scale.

Define Custom Functions to Evaluate the Models

Training Logistic Regression Model

Training a Random Forest Model

This model is overfitting. I will fine-tune the model using GridSearchCV to select the best hyperparameters for the random forest model.

Display the Top 10 Important Features

Visualize the Results

Experiment 2: Train Models without Correlated Features

Drop Correlated Features

Separate the features and target variable

The target variable is the Class variable.

Split Data into the training set and test set

70% of the data will be used for training the models. While 30% will be used to evaluate the models.

Feature Scaling

Apply normalization to the features to keep them on the same scale.

Training Logistic Regression Model

Training a Random Forest Model

This model is overfitting. I will fine-tune the model using GridSearchCV to select the best hyperparameters for the random forest model.

Display the Top 10 Important Features

Visualize the Results